Cloning a Repo

Luckily, for all of us, this course will not consist of me talking for 75 minutes. Rather, active learning components will be interspersed within the day. In general, we will spend some time talking about a topic and then I will give you time to work through data visualization or analysis.

To facilitate the active learning sessions, I’d recommend cloning the repo at the start of class. For instance, here is the repo for today https://github.com/Stat534/Lecture3. Using R Studio to create a local project is one way to do this, but you won’t have push access to my repo.

Data Viz Overview

Data visualization is an essential part of the statistical analysis process, both for exploratory analyses and summarizing findings.

Exploratory Data Analysis
  • Used for data quality checks

  • Help explore and understand the data

  • Typically, not seen by anyone else

Polished Data Visualization
  • Used to summarize data in presentations or papers

  • Should stand alone with appropriate titles, axes, labels, and captions

  • ggplot2 makes creating polished figures easier: cheat sheet

Spatial Data Viz Tools

There are many tools for creating spatial figures (GIS software, Tableau, etc…), but we will exclusively use R and the wide range of packages within it.

In particular, we will use:

  • ggplot2

  • ggmap

  • leaflet

  • RgoogleMaps

  • and many others…

Point Data: What is this?

Point Data: How about now?

Point Data: Is this better?

ggmap Code Overview

 mykey <- read_file('./google_api.txt')
  register_google(key = mykey)
  myMap <- get_map(location = c(lon = - 74,lat = 40.75),
                 source = "google",
                 maptype = "roadmap", crop = FALSE,
                 zoom = 11, api_key = mykey)

  ggmap(myMap) + 
    geom_point(aes(x=Lon, y=Lat), alpha=.03, size=.5, data=uber) + 
    labs(title = 'Location of Uber pickups on May 1, 2014 for NYC Destinations', 
    caption = 'source: https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city') + 
    xlab('') + ylab('') +
    theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank())

Principles for Point Data

  1. Include useful background for appropriate context: there are several approaches for acquiring maps in R. Sometimes streets may be more useful, but in other situation a terrain image might be more relevant.
  2. With a point patterns, use transparency or heat map summaries to distinguish between areas of higher and lower intensity.
  3. Include useful titles, labels, and where appropriate, captions (all figures). These figures should stand alone.
  4. Sources should be cited in figures.

Active Learning Exercise: Seattle Police Calls Data Viz

seattle <- read_csv('./SeattlePolice.csv')
## Parsed with column specification:
## cols(
##   CAD.Event.Number = col_double(),
##   Event.Clearance.Description = col_character(),
##   Event.Clearance.SubGroup = col_character(),
##   Event.Clearance.Group = col_character(),
##   Census.Tract = col_double(),
##   Longitude = col_double(),
##   Latitude = col_double(),
##   Year = col_integer(),
##   Month = col_integer(),
##   Day = col_integer()
## )

Cartography

Distance Calculations

A collaborator suggests that there may a spatial relationship between the police calls in the Seattle Data Set. How would you calculate the distance between those points?

Event.Clearance.Description Longitude Latitude
TRAFFIC (MOVING) VIOLATION -122.3392 47.61372
SHOPLIFT -122.3384 47.61824
DISTURBANCE, OTHER -122.3474 47.61271

Distance Calculations, follow up

A collaborator suggests that there may a spatial relationship between the police calls in the Seattle Data Set. How would you calculate the distance between those points?

Event.Clearance.Description Longitude Latitude
TRAFFIC (MOVING) VIOLATION -122.3392 47.61372
SHOPLIFT -122.3384 47.61824
DISTURBANCE, OTHER -122.3474 47.61271

As a follow up:

  • what are the units for your distances?
  • are they consistent across latitude?

Map Projections

Map Projections

Map projections are a representation of a surface on a plane. Specifically, functions are designed to map the geographical coordinate system \((\lambda, \phi)\) to rectangular or polar coordinates (\(x, y\)) such that:

\[x = f(\lambda, \phi), \; \; y = g(\lambda, \phi),\]

where \(f\) and \(g\) are functions that define the projection.

Mercator projections

A projection type we will use in this class is the Mercator projection. The Mercator projection gives the common square maps you are used to looking at.

\[ f(\lambda, \phi) = R\lambda, \; \; g(\lambda, \phi) = R * ln \left( tan \left( \frac{\pi}{4} + \frac{\phi}{2}\right) \right)\]

The Mercator projection typically includes a rectangular grid and locations are expressed in meters from the intersection of grid lines. Common terminology is to use “easting” and “northings” for the coordinates.

UTM Projections

The UTM projection divides the world into 60 vertical slices, known as zones.

Distances on the earth’s surface

According to Gauss’ Theorema Eggregrium in differential geometry, a planar map projection that preserves distances between points does not exist.

Distance Metrics

As a precursor to this class, I said “objects close in space tend to be more similar.” Mathematically, we will require precise distance measurements between points.

  • Euclidean distance: is the straightline distance between two points, recall the Pythagorean thereom. This works okay (with latitude and longitude) for points in a small spatial domain, such as New York City. However, the curvature of the earth causes distortions for larger areas.

  • geodesic distance: is the length of the arc on the Earth’s surface.

Geodesic distance

The geodesic distance in computed as \(D = R \phi,\) where \(R\) is the radius of the earth and \(\phi\) is an angle (in radians) such that:

\[\cos \phi = \sin\theta_1 \sin \theta_2 + \cos \theta_1 \cos \theta_2 cos(\lambda_1 - \lambda_2),\] where \(\theta_1\) and \(\theta_2\) are the latitude measurements and \(\lambda_1\) and \(\lambda_2\) are the longitude measurements for two points.

Thus

\[D = R \arccos[\sin\theta_1 \sin \theta_2 + \cos \theta_1 \cos \theta_2 cos(\lambda_1 - \lambda_2)]\]

Chordal Distance

Another alternative is to use what is known as the chordal distance, which is equivalent to the “burrowed through the earth” distance between two points on the Earth’s surface.

Let where \(x,\) \(y\), and \(z\) form a set of Cartesian coordinates with the origin at the center of the earth, the \(z\)-axis runs between the north and south poles.

Then the chordal distance can be calculated as the Euclidean distance between two vectors \(\boldsymbol{u}_1 = (x_1, y_1, z_1)\) and \(\boldsymbol{u}_1 = (x_1, y_1, z_1)\).

Distance Calculation Example

Calculate the distance between Chicago (41.8781° N, 87.6298° W) and Minneapolis (44.9778° N, 93.2650° W) using naive Euclidean, geodesic, and chordal measures. Note naive Euclidean can be computed by multiplying Euclidean distance (on radians) by R.

theta1 <- 41.88 * pi / 180 # in Radians
theta2 <- 44.89 * pi / 180
lambda1 <- 87.63 * pi / 180
lambda2 <- 93.22 * pi / 180
R <- 6371

Distance Calculation Solution

geodesic <- R * acos(sin(theta1)*sin(theta2) + 
 cos(theta1)*cos(theta2)*cos(lambda1 - lambda2))

euc <- sqrt((theta1 - theta2)^2 + (lambda1 - lambda2)^2) * R 

chord <- sqrt(
 (R * cos(theta1) * cos(lambda1) - R * cos(theta2) * cos(lambda2))^2 +
 (R * cos(theta1) * sin(lambda1) - R * cos(theta2) * sin(lambda2))^2 +
 (R * sin(theta1) - R * sin(theta2))^2)  

The end result is that the geodesic distance is 561.99 distance and the chordal distance 561.81 are very close, but the naive Euclidean 705.96 is quite a bit different.

Spatial Data in R

Spatial Data Structure

Spatial objects in R have two different structures:

  • Vector data: vector data consist of points, shapes and lines, all with defined borders.

  • Raster data: consists of a gridded set of pixels and often coincides with remote sensed data.

Plotting Shape Files

Raster data

A raster consists of a matrix that contains values for each pixel.

set.seed(01162019)
tmp.raster <- raster(ncol=5, nrow=5, ymx = 5, ymn = 1,xmx=5, xmn=1, crs = NA)
tmp.raster
## class       : RasterLayer 
## dimensions  : 5, 5, 25  (nrow, ncol, ncell)
## resolution  : 0.8, 0.8  (x, y)
## extent      : 1, 5, 1, 5  (xmin, xmax, ymin, ymax)
## coord. ref. : NA
values(tmp.raster) <- runif(25)

Raster data

Additional References